Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

OCR binarization and image pre-processing for searching historical documents

Identifieur interne : 000F29 ( Main/Exploration ); précédent : 000F28; suivant : 000F30

OCR binarization and image pre-processing for searching historical documents

Auteurs : Maya R. Gupta [États-Unis] ; Nathaniel P. Jacobson [États-Unis] ; Eric K. Garcia [États-Unis]

Source :

RBID : Pascal:07-0059470

Descripteurs français

English descriptors

Abstract

We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarization, a multiresolutional version of Otsu's binarization, and denoising by despeckling. The OCR in the ABBYY FineReader 7.1 SDK is used as a black box metric to compare methods. Results for 12 pages from six newspapers of differing quality show that performance varies widely by image, but that the classic Otsu method and Otsu-based methods perform best on average.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">OCR binarization and image pre-processing for searching historical documents</title>
<author>
<name sortKey="Gupta, Maya R" sort="Gupta, Maya R" uniqKey="Gupta M" first="Maya R." last="Gupta">Maya R. Gupta</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author>
<name sortKey="Jacobson, Nathaniel P" sort="Jacobson, Nathaniel P" uniqKey="Jacobson N" first="Nathaniel P." last="Jacobson">Nathaniel P. Jacobson</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author>
<name sortKey="Garcia, Eric K" sort="Garcia, Eric K" uniqKey="Garcia E" first="Eric K." last="Garcia">Eric K. Garcia</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">07-0059470</idno>
<date when="2007">2007</date>
<idno type="stanalyst">PASCAL 07-0059470 INIST</idno>
<idno type="RBID">Pascal:07-0059470</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000355</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000431</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000271</idno>
<idno type="wicri:doubleKey">0031-3203:2007:Gupta M:ocr:binarization:and</idno>
<idno type="wicri:Area/Main/Merge">000F42</idno>
<idno type="wicri:Area/Main/Curation">000F29</idno>
<idno type="wicri:Area/Main/Exploration">000F29</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">OCR binarization and image pre-processing for searching historical documents</title>
<author>
<name sortKey="Gupta, Maya R" sort="Gupta, Maya R" uniqKey="Gupta M" first="Maya R." last="Gupta">Maya R. Gupta</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author>
<name sortKey="Jacobson, Nathaniel P" sort="Jacobson, Nathaniel P" uniqKey="Jacobson N" first="Nathaniel P." last="Jacobson">Nathaniel P. Jacobson</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
<author>
<name sortKey="Garcia, Eric K" sort="Garcia, Eric K" uniqKey="Garcia E" first="Eric K." last="Garcia">Eric K. Garcia</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>Electrical Engineering, University of Washington</s1>
<s2>Seattle, Washington 98195</s2>
<s3>USA</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>États-Unis</country>
<placeName>
<settlement type="city">Seattle</settlement>
<region type="state">Washington (État)</region>
</placeName>
<orgName type="university">Université de Washington</orgName>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">Pattern recognition</title>
<title level="j" type="abbreviated">Pattern recogn.</title>
<idno type="ISSN">0031-3203</idno>
<imprint>
<date when="2007">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">Pattern recognition</title>
<title level="j" type="abbreviated">Pattern recogn.</title>
<idno type="ISSN">0031-3203</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Binary image</term>
<term>Despeckling</term>
<term>Dithering</term>
<term>Error diffusion</term>
<term>Filtering</term>
<term>Image processing</term>
<term>Implementation</term>
<term>Keyword</term>
<term>Multiresolution analysis</term>
<term>Noise reduction</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Performance evaluation</term>
<term>Printed document</term>
<term>Signal processing</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Reconnaissance optique caractère</term>
<term>Image binaire</term>
<term>Mot clé</term>
<term>Document imprimé</term>
<term>Filtrage</term>
<term>Réduction bruit</term>
<term>Implémentation</term>
<term>Diffusion d'erreur</term>
<term>Analyse multirésolution</term>
<term>Evaluation performance</term>
<term>Tramage</term>
<term>Reconnaissance forme</term>
<term>Traitement signal</term>
<term>Traitement image</term>
<term>Déchatoiement</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">We consider the problem of document binarization as a pre-processing step for optical character recognition (OCR) for the purpose of keyword search of historical printed documents. A number of promising techniques from the literature for binarization, pre-filtering, and post-binarization denoising were implemented along with newly developed methods for binarization: an error diffusion binarization, a multiresolutional version of Otsu's binarization, and denoising by despeckling. The OCR in the ABBYY FineReader 7.1 SDK is used as a black box metric to compare methods. Results for 12 pages from six newspapers of differing quality show that performance varies widely by image, but that the classic Otsu method and Otsu-based methods perform best on average.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>États-Unis</li>
</country>
<region>
<li>Washington (État)</li>
</region>
<settlement>
<li>Seattle</li>
</settlement>
<orgName>
<li>Université de Washington</li>
</orgName>
</list>
<tree>
<country name="États-Unis">
<region name="Washington (État)">
<name sortKey="Gupta, Maya R" sort="Gupta, Maya R" uniqKey="Gupta M" first="Maya R." last="Gupta">Maya R. Gupta</name>
</region>
<name sortKey="Garcia, Eric K" sort="Garcia, Eric K" uniqKey="Garcia E" first="Eric K." last="Garcia">Eric K. Garcia</name>
<name sortKey="Jacobson, Nathaniel P" sort="Jacobson, Nathaniel P" uniqKey="Jacobson N" first="Nathaniel P." last="Jacobson">Nathaniel P. Jacobson</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F29 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F29 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:07-0059470
   |texte=   OCR binarization and image pre-processing for searching historical documents
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024